arXiv:1507.08788vl [cs.LG] 31Jul2015 


Fast Stochastic Algorithms for SVD and PCA: 
Convergence Properties and Convexity 

Ohad Shamir 

Weizmann Institute of Science 
ohad.shamir@weizmann.ac.il 


Abstract 

We study the convergence properties of the VR-PCA algorithm introduced by ifT^ for fast compu¬ 
tation of leading singular vectors. We prove several new results, including a formal analysis of a block 
version of the algorithm, and convergence from random initialization. We also make a few observa¬ 
tions of independent interest, such as how pre-initializing with just a single exact power iteration can 
significantly improve the runtime of stochastic methods, and what are the convexity and non-convexity 
properties of the underlying optimization problem. 


1 Introduction 


We consider the problem of recovering the top k left singular vectors of a d x n matrix X = (xi,..., x„), 
where k d. This is equivalent to recovering the top k eigenvectors of XX~^, or equivalently, solving the 
optimization problem 


min —W~^ 



( 1 ) 


This is one of the most fundamental matrix computation problems, and has numerous uses (such as low-rank 
matrix approximation and principal component analysis). 

For large-scale matrices X, where exact eigendecomposition is infeasible, standard deterministic ap¬ 
proaches are based on power iterations or variants thereof (e.g. the Lanczos method) |8]. Alternatively, 
one can exploit the structure of Eq. ([T]l and apply stochastic iterative algorithms, where in each iteration 
we update a current d x k matrix W based on one or more randomly-drawn columns Xj of X. Such algo¬ 
rithms have been known for several decades ( ifldl flTll '). and enjoyed renewed interest in recent years, e.g. 
EnaanoiEi. Another stochastic approach is based on random projections, e.g. Il9ll20l. 

Unfortunately, each of these algorithms suffer from a different disadvantage: The deterministic algo¬ 
rithms are accurate (runtime logarithmic in the required accuracy e, under an eigengap condition), but re¬ 
quire a full pass over the matrix for each iteration, and in the worst-case many such passes would be required 
(polynomial in the eigengap). On the other hand, each iteration of the stochastic algorithms is cheap, and 
their number is independent of the size of the matrix, but on the flip side, their noisy stochastic nature means 
they are not suitable for obtaining a high-accuracy solution (the runtime scales polynomially with e). 

Recently, ifT^ proposed a new practical algorithm, VR-PCA, for solving Eq. Q, which has a “best-of- 
both-worlds” property: The algorithm is based on cheap stochastic iterations, yet the algorithm’s runtime is 
logarithmic in the required accuracy e. More precisely, for the case k = 1, x, of bounded norm, and when 
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there is an eigengap of A between the first and second leading eigenvalues of the covariance matrix ^XX~^, 
the required runtime was shown to be on the order of 

^ log • (2) 

The algorithm is therefore suitable for obtaining high accuracy solutions (the dependence on e is logarith¬ 
mic), but essentially at the cost of only 0{\og{l/e)) passes over the data. The algorithm is based on a recent 
variance-reduction technique designed to speed up stochastic algorithms for convex optimization problems 
(US), although the optimization problem in Eq. © is inherently non-convex. See Section for a more 
detailed description of this algorithm, and lfT9ll for more discussions as well as empirical results. 

The results and analysis in ifTOll left several issues open. For example, it is not clear if the quadratic 
dependence on 1/A in Eq. © is necessary, since it is worse than the linear (or better) dependence that 
can be obtained with the deterministic algorithms mentioned earlier, as well as analogous results that can 
be obtained with similar techniques for convex optimization problems (where A is the strong convexity 
parameter). Also, the analysis was only shown for the case k = 1, whereas often in practice, we may want 
to recover k > 1 singular vectors simultaneously. Although ||T^ proposed a variant of the algorithm for that 
case, and studied it empirically, no analysis was provided. Finally, the convergence guarantee assumed that 
the algorithm is initialized from a point closer to the optimum than what is attained with standard random 
initialization. Although one can use some other, existing stochastic algorithm to do this “warm-start”, no 
end-to-end analysis of the algorithm, starting from random initialization, was provided. 

In this paper, we study these and related questions, and make the following contributions: 

• We propose a variant of VR-PCA to handle the k > 1 case, and formally analyze its convergence 
(Section©. The extension to A: > 1 is non-trivial, and requires tracking the evolution of the subspace 
spanned by the current solution at each iteration. 

• In Section© we study the convergence of VR-PCA starting from a random initialization. And show 
that with a slightly smarter initialization - essentially, random initialization followed by a single power 
iteration - the convergence results can be substantially improved. In fact, a similar initialization 
scheme should assist in the convergence of other stochastic algorithms for this problem, as long as a 
single power iteration can be performed. 

• In Section© we study whether functions similar to Eq. © have hidden convexity properties, which 
would allow applying existing convex optimization tools as-is, and improve the required runtime. For 
the k = 1 case, we show that this is in fact true: Close enough to the optimum, and on a suitably- 
designed convex set, such a function is indeed A-strongly convex. Unfortunately, the distance from 
the optimum has to be C)(A), and this precludes a better runtime in most practical regimes. However, 
it still indicates that a better runtime and dependence on A should be possible. 


2 Some Preliminaries and Notation 

We consider ad x n matrix X composed of n columns (xi,..., x„), and let 

1 1 ” 

A= -XX^ = - 

i=l 

Thus, Eq. © is equivalent to finding the k leading eigenvectors of A. 
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We generally use bold-face letters to denote vectors, and capital letters to denote matrices. We let Tr(-) 
denote the trace of a matrix, || • ||f to denote the Frobenius norm, and || • ||sp to denote the spectral norm. 
A symmetric d x d matrix B is positive semidefinite, if inf^g^d Bz >0. A is positive definite if the 
inequality is strict. Following standard notation, we write i? A 0 to denote that A is positive semidefinite, 
and B'^C if B — C'^0. B>-0 means that B is positive definite. 

A twice-differentiable function F on a subset of is convex, if its Hessian is alway positive semidef¬ 
inite. If it is always positive definite, and XI for some A > 0, we say that the function is A-strongly 
convex. If the Hessian is always -< si for some s > 0, then the function is s-smooth. 

3 The VR-PCA Algorithm and a Block Version 

We begin by recalling the algorithm of lIT^ for the k = 1 case (Algorithm [^, and then discuss its general¬ 
ization for A: > 1. 


Algorithm 1 VR-PCA: Vector version (A: = 1) 

1 

Parameters: Step size rj, 

epoch length m 

2 

Input: Data matrix X = 

(xi,..., x„); Initial unit vector wq 

3 

for s = 1, 2,... do 


4 

~ 1 / T - 

^ = n Si=l X* (X^' W, 

-i) 

5 

Wo = Ws -1 


6 

for f = 1, 2,..., m do 


7 

Pick it G {1,.. ., n} 

uniformly at random 

8 

= wt_i -h r/ (xj, 

(x^wt.i - + u) 

9 

Wt - ||^,||W' 


10 

end for 


11 

Ws = Wm 


12 

end for 



The basic idea of the algorithm is to perform stochastic updates using randomly-sampled columns Xj of 
the matrix, but interlace them with occasional exact power iterations, and use that to gradually reduce the 
variance of the stochastic updates. Specifically, fhe algorithm is split into epochs s = 1,2,..., where in 
each epoch we do a single exact power iteration with respect to the matrix A (by computing u), and then 
perform m stochastic updates, which can be re-written as 

w[ = (I-h 7yA)wt_i-h r/- a) (wt_i - w^_i) , = —*— 

^ ^ W'^tW 

The first term is essentially a power iteration (with a finite step size if), whereas the second term is zero- 
mean, and with variance dominated by ||wt_i — Ws_i|p. As the algorithm progresses, Wi_i and 
both converge toward the same optimal point, hence ||wt_i — Ws_i|p shrinks, eventually leading to an 
exponential convergence rate. 

To handle the A: > 1 case (where more than one eigenvector should be recovered), one simple technique 
is deflation, where we recover the leading eigenvectors vi, V 2 ,..., one-by-one, each time using the 
A; = 1 algorithm. However, a disadvantage of this approach is that it requires a positive eigengap between 
all top k eigenvalues, otherwise the algorithm is not guaranteed to converge. Thus, an algorithm which 
simultaneously recovers all k leading eigenvectors is preferable. 
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We will study a block version of Algorithm [TJ presented as Algorithm It is mostly a straightfor¬ 
ward generalization (similar to how power iterations are generalized to orthogonal iterations), where the 
d-dimensional vectors wj_i, Ws_i, u are replaced by d x k matrices Wt-i, Wg-i, U, and normalization is 
replaced by orthogonalizatiorQ Indeed, Algorithm [^is equivalent to Algorithm when k = 1. The main 
twist in Algorithm is that instead of using Wg-i, U as-is, we perform a unitary transformation (via the 
k X k orthogonal matrix Bt-\) which maximally aligns them with Wt-i- Note that Bt-i is a fc x fc matrix, 
and since k is assumed to be small, this does not introduce significant computational overhead. 


Algorithm 2 VR-PCA: Block version 

Parameters: Rank k, Step size r/, epoch length m 

Input: Data matrix X = (xi,..., x„); Initial d x k matrix Wq with orthonormal 
columns 

for s = 1, 2,... do 

Wo = 

for f = 1, 2,..., m do 

Bt-i = VU~^, where USV~^ is an SVD decomposition of Wj_iWs-i 
\> Equivalent to Bt-i = argmin^T^^/ ||kFt_i — Ws-iB\\‘^p 
Pick it G {1,..., n} uniformly at random 

Wl = Wt-i + V (x*, - xlWs-iBt-i) + UBt-i) 

1 /2 

wt = Wl (wi^wiy 

end for 

Ws = W,n 

end for 


We now turn to provide a formal analysis of Algorithm which directly generalizes the analysis of 
Algorithm [T] given in IT9l : 

Theorem 1. Define the d x d matrix A as ^ XjX^, and let I 4 denote the d x k matrix 

composed of the eigenvectors corresponding to the largest k eigenvalues. Suppose that 

• max, ||xj|p < r for some r > 0. 

• A has eigenvalues si > S 2 > ■ ■ ■ > Sd, where Sk — Sfc+i = A/or some A > 0. 

• k-\\VjJWo\\l<l. 

Let 5,€ € (0,1) be fixed. If we run the algorithm with any epoch length parameter m and step size rj, such 
that ^ 

p < , m > ^ ^ kmp^r"^ + rksJmp'^ log(2/5) < f (3) 

pX 

*The normalization Wt = Wl (^Wt^Wl'j ensures that Wt has orthonormal columns. We note that in our analysis, rj is 
chosen sufficiently small so that Wt^Wl is always invertible, hence the operation is well-defined. 
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(where c, c', d' designate certain positive numerical constants), and for T 
probability at least 1 — [log2(l/e)](5, it holds that 

k-\\v;jWT\\l < e. 


log(l/«s) 

log(2/<5) 


epochs, then with 


For any orthogonal W, k — \\V^W\\\ lies between 0 and k, and equals 0 when the column spaces of 
14 and W are the same (i.e., when W spans the k leading singular vectors). According to the theorem, 
taking appropriate]^ ry = 0(A/(/cr)^), and m = Q{{rk/X)‘^), the algorithm converges with high probability 
to a high-accuracy approximation of V^. Moreover, the runtime of each epoch of the algorithm equals 
0{mdk^ + dnk). Overall, we get the following corollary: 

Corollary 1, Under the conditions of Theorem there exists an algorithm returning Wt such that k — 
\\V ^< e with arbitrary constant accuracy, in runtime O ^dk{n + =Si)iog(iA)). 

This runtime bound is the sambas that of llT9ll for /c = 1. 

The proof of Theorem [T] appears in Subsection |6.1[ and relies on a careful tracking of the evolution 
of the potential function k — An important challenge compared to the k = 1 case is that the 

matrices Wt-i and Wg-i do not necessarily become closer over time, so the variance-reduction intuition 
discussed earlier no longer applies. However, the column space of Wt-i and Wg-i do become closer, and 
this is utilized by introducing the transformation matrix Bt-i. We note that although Bt-i appears essential 
for our analysis, it isn’t clear that using it is necessary in practice: In |[T^ . the suggested block algorithm 
was Algorithm]^ with Bt-i = I, which seemed to work well in experiments. In any case, using this matrix 
doesn’t affect the overall runtime beyond constants, since the additional runtime of computing and using 
this matrix (0{dk‘^)) is the same as the other computations performed at each iteration. 

A limitation of the theorem above is the assumption that the initial point Wq is such that A:—11 Wq 11 < 
i. This is a non-trivial assumption, since if we initialize the algorithm from a random d x 0(1) orthogo¬ 
nal matrix Wq, then with overwhelming probability, ||V)J 1^0111' = 0{l/d). However, experimentally the 
algorithm seems to work well even with random initialization |[T9ll . Moreover, if we are interested in a 
theoretical guarantee, one simple solution is to warm-start the algorithm with a purely stochastic algorithm 
for this problem (such as ll6l[T0l|4]|), with runtime guarantees on getting such a Wq. The idea is that Wq is 
only required to approximate I 4 up to constant accuracy, so purely stochastic algorithms (which are good in 
obtaining a low-accuracy solution) are quite suitable. In the next section, we further delve into these issues, 
and show that in our setting such algorithms in fact can be substantially improved. 


4 Warm-Start and the Power of a Power Iteration 


In this section, we study the runtime required to compute a starting point satisfying the conditions of Theo- 
rem[T| starting from a random initialization. Combined with Theorem]^ this gives us an end-to-end analysis 
of the runtime required to find an e-accurate solution, starting from a random point. For simplicity, we will 
only discuss the case k = 1, i.e. where our goal is to compute the single leading eigenvector vi, although 


^Specifically, we can take m = c' \og{2/5)/r]\ and rj = aS^/r^X, where a is sufficiently small to ensure that the first and third 
condition in Eq. 1 3 1 holds. It can be verified that it’s enough to take a = min |c, ( k iog(2/6) ) | • 

'Gil showed that it’s possible to further improve the runtime for sparse X, replacing d by the average column sparsity ds- This 
is done by maintaining parameters in an implicit form, but it’s not clear how to implement a similar trick in the block version, where 
fe > 1. 
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our observations can be generalized to /c > 1. In the k = 1 case, Theorem [T] kicks in once we find a vector 
w satisfying (vi, -w )2 > 1. 

As mentioned previously, one way to get such a w is to run a purely stochastic algorithm, which com¬ 
putes the leading eigenvector of a covariance matrix E[xx^] given a stream of i.i.d. samples x. We can 
easily use such an algorithm in our setting, by sampling columns from our matrix X = (xi,..., x„) uni¬ 
formly at random, and feed to such a stochastic optimization algorithm, guaranteed to approximate the 
leading eigenvector of ^ ^*^7- 

To the best of our knowledge, the existing iteration complexity guarantees for such algorithms (assuming 
the norm constraint r < 1 for simplicity) scale at leasj^as d/\^. Since the runtime of each iteration is 0{d), 
we get an overall runtime of C>((d/A)2). 

The dependence on d in the iteration bound stems from the fact that with a random initial unit vector 
wo^ we have (vi, w^o)^ ~ Thus, we begin with a vector almost orthogonal to the leading eigenvector vi 
(depending on d). In a purely stochastic setting, where only noisy information is available, this necessitates 
conservative updates at first, and in all the analyses we are aware of, the number of iterations appear to 
necessarily scale at least linearly with d. 

However, it turns out that in our setting, with a finite matrix X, we can perform a smarter initialization: 
Sample w from the standard Gaussian distribution on M'^, perform a single power iteration w.r.t. the covari¬ 
ance matrix A = -XX~^, i.e. wq = ^w/HA-wH, and initialize from wq. For such a procedure, we have the 
following simple observation: 

Lemma 1. For wq as above, it holds for any 5 that with probability at least 1 — ^ — <5, 

\2 

° “ 12 log(d) nranA:(A) ’ 

II All^ 

where nrank{A) = is the numerical rank of A. 

The numerical rank (see e.g. lIT^ l is a relaxation of the standard notion of rank: For any d x d matrix 
A, nrank(A) is at most the rank of A (which in turn is at most d). However, it will be small even if A 
is just close to being low-rank. In many if not most machine learning applications, we are interested in 
matrices which tend to be approximately low-rank, in which case nrank(A) is much smaller than d or even 
a constant. Therefore, by a single power iteration, we get an initial point -wq for which (vi, wq)^ is on the 
order of l/nrank(A), which can be much larger than the l/d given by a random initialization, and is never 
substantially worse. 


Proof of Lemma^I^ Let si > S 2 > • • • > Sd > 0 be the d eigenvalues of A, with eigenvectors vi,..., v^. 
We have 


(vij’Wo)^ 


(vi, Aw)^ 

II A-w||2 


(si(vi, w))" 


s?(vi, w) 




Since w is distributed according to a standard Gaussian distribution, which is rotationally symmetric, we 
can assume without loss of generality that vi,..., correspond to the standard basis vectors oi,..., e^, 
in which case the above reduces to 


2,„2 


StW 


lu/i 


E d 2 2 

*=i Si < 


> 




E d 
i=l 


^2 maxj wf ’ 


"'For example, this holds for (3, although the bound only guarantees the existence of some iteration which produces the desired 
output. The guarantee of (4) scale as and the guarantee of iflOl scales as d/A® in our setting. 
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where wi,... ,Wd are independent and scalar random variables with a standard Gaussian distribution. 

First, we note that sf equals ||^||sp, the spectral norm of A, whereas Yli=i equals ||^|||', the Probe¬ 
s' \\A\\^ 1 

nius norm of A. Therefore, v=r^ = , ' .y = —and we get overall that 

II^IIP nrank(A)’ & 


(vi,Wo)^ > 


1 wf 

nrank(A) max* wf 


(4) 


We consider the random quantity wf/ maxj wf, and independently bound the deviation probability of 
the numerator and denominator. First, for any f > 0 we have 


Pr(mf < f) = Pr(mi G [-Vf, Vt]) = y ^ 

Second, by combining two standard Gaussian concentration results (namely, that if FF = max{ | tui |,..., | |}, 

then 0 < E[FF] < 2^2 log{d), and by the Cirelson-Ibragimov-Sudakov inequality, Pr(FF — E[FF] > t) < 
exp(—f^/2)), we get that 

Pr(max|u;j| > 2y/2\og(d) + t)< exp(—f^/2), 

i 

and therefore 

Pr(maxt(;f > (2-^2log(d) +f)^) < exp(—f/2). (6) 

i 

Combining EqJ^ and Eq. Q, with a union bound, we get that for any fi, f 2 > 0, it holds with probability 
at least 1 — y f G — exp(—f|/2) that 

^^1 >_it_ 

maxjm^ - (2.^2 log{d) + f 2 )^' 


To slightly simplify this for readability, we take t 2 = y/2 log(d), and substitute S = y f G- This implies 
that with probability at least 1 — 6 — 1/d, 

wf 6^ 

- - — > — - ->- . 

meiKiwf 181og(d) 121og((i) 

Plugging back into Eq. Q, the result follows. □ 

This result can be plugged into the existing analyses of purely stochastic PCA/SVD algorithms, and 
can often improve the dependence on the d factor in the iteration complexity bounds to a dependence on 
the numerical rank of A. We again emphasize that this is applicable in a situation where we can actually 
perform a power iteration, and not in a purely stochastic setting where we only have access to an i.i.d. data 
stream (nevertheless, it would be interesting to explore whether this idea can be utilized in such a streaming 
setting as well). 

To give a concrete example of this, we provide a convergence analysis of the VR-PCA algorithm (Algo¬ 
rithm!^, starting from an arbitrary initial point, bounding the total number of stochastic iterations required 
by the algorithm in order to produce a point satisfying the conditions of Theorem [T] (from which point the 
analysis of Theorem [T] takes over). Combined with Theorem [TJ this analysis also justifies that VR-PCA 
indeed converges starting from a random initialization. 
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Theorem 2. Using the notation of Theorem^(where A is the eigengap, vi is the leading eigenvector, and 
r = maxi ||xi|pj, and for any 6 G (0, g), suppose we run Algorithm^^with some initial unit-norm vector 
wq such that 

(vi, wo)^ > C > 0, 


and a step size p satisfying 


P < 




log^(2/5) 

(for some universal constant c). Then with probability at least 1 — 5, after 

d log(2/5) 


(V) 


T = 


pXC 


stochastic iterations (lines 6 — 
wt satisfying 1 — (vi, < 

then 


10 in the pseudocode, where d is again a universal constant), we get a point 
2 - Moreover, if p is chosen on the same order as the upper bound in Eq. A7v, 


T = 


/ log^(2/(5)\ 

V < 52 ^ 2^4 J ■ 


Note that the analysis does not depend on the choice of the epoch size m, and does not use the special 
structure of VR-PCA (in fact, the technique we use is applicable to any algorithm which takes stochastic 
gradient steps to solve this type of problerr0. The proof of the theorem appears in Section 

Considering (5, r as a constants, we get that the runtime required by VR-PCA to find a point w such that 
1 — (vi, ^ 2 is where ^ is a lower bound on (vi,wo)^. As discussed earlier, if wq is 

a result of random initialization followed by a power iteration (requiring 0{nd) time), and the covariance 
matrix A has small numerical rank, then ^ = (vi, wq)^ = 0(1/ log((i)), and the runtime is 


6.2 


O 





By Corollary [T] the runtime required by VR-PCA from that point to get an e-accurate solution is 

o(<i(n+l,)log(l)), 


SO the sum of the two expressions (which is d (n -|- ^) up to log-factors), represents the total runtime 
required by the algorithm. 

Finally, we note that this bound holds under the reasonable assumption that the numeric rank of A 
is constant. If this assumption doesn’t hold, (/ can be as large as d, and the resulting bound will have 
a worse polynomial dependence on d. We suspect that this is due to a looseness in the dependence on 
(/ = (vi,wo)2 in Theorem]^ since better dependencies can be obtained, at least for slightly different 
algorithmic approaches (e.g. l|4l[T0l|6l). We leave a sharpening of the bound w.r.t. Q as an open problem. 

^Although there exist previous analyses of such algorithms in the literature, they unfortunately do not quite apply to our algo¬ 
rithm, for various technical reasons. 










5 Convexity and Non-Convexity of the Rayleigh Quotient 

As mentioned in the introduction, an intriguing open question is whether the d[n + log (^) runtime 
guarantees from the previous sections can be further improved. Although a linear dependence on d, n seems 
unavoidable, this is not the case for the quadratic dependence on 1/A. Indeed, when using deterministic 
methods such as power iterations or the Lanczos method, the dependence on A in the runtime is only 1 / A or 
even y^l/A ||T5ll . In the world of convex optimization from which our algorithmic techniques are derived, 
the analog of A is the strong convexity parameter of the function, and again, it is possible to get a dependence 
of 1/A, or even y^l/A with accelerated schemes (see e.g. ffTSlfT^ ITl in the context of the variance-reduction 
technique we use). Is it possible to get such a dependence for our problem as well? 

Another question is whether the non-convex problem that we are tackling (Eq. Q) is really that non- 
convex. Clearly, it has a nice structure (since we can solve the problem in polynomial time), but perhaps it 
actually has hidden convexity properties, at least close enough to the optimal points? We note that Eq. Q 
can be “trivially” convexified, by re-casting it as an equivalent semidefinite program fSl. However, that 
would require optimization over d x d matrices, leading to poor runtime and memory requirements. The 
question here is whether we have any convexity with respect to the original optimization problem over 
“thin” d X k matrices. 

In fact, the two questions of improved runtime and convexity are closely related: If we can show that the 
optimization problem is convex in some domain containing an optimal point, then we may be able to use 
fast stochastic algorithms designed for convex optimization problems, inheriting their good guarantees. 

To discuss these questions, we will focus on the k = 1 case for simplicity (i.e., our goal is to find a 
leading eigenvector of the matrix A = ^XX~^ = ^ ^*^7)> study potential convexity properties 

of the negative Rayleigh quotient. 


Fa{^) = - 


w''~Aw 


w I 


1 

-E 

n ^ 

2=1 


(w,Xi 


w 


Note that for A: = 1, this function coincides with Eq. ([T]l on the unit Euclidean sphere, and with the same 
optimal points, but has the nice property of being defined on fhe enfire Euclidean space (fhus, af leasf ifs 
domain is convex). 

Af a firsf glance, such funcfions Fa appear fo pofenfially be convex af some bounded disfance from an 
opfimum, as illusfrafed for insfance in fhe case where A= ^ 


0 0 


(see Eigure 


1 1 . Unforlunalely, if furns 


ouf fhaf fhe figure is misleading, and in facf fhe funclion is not convex almost everywhere: 


Theorem 3. For the matrix A above, the Hessian of Fa is not positive semidefinite for all but a measure-zero 
set. 

2 

Proof The leading eigenvector of A is vi = (1,0), and Fa{w) =- ^ ■ The Hessian of this function 

at some w equals 

2 / W2{SWi — W 2 ) —2wiW2{wf — W 2 ) \ 

{wj + \ - 2 wiW 2 {wf - wl) wf{wf-3w2) )' 


9 









Figure 1: The function {wi,W 2 ) ^ , 

corresponding to Fa{v^) where A = (1 0 ; 0 0). 
It is invariant to re-scaling of w, and attains a 
minimum at (a, 0) for any a ^ 0. 



Figure 2: Illustration of the construction of the 
convex set on which Fa is strongly convex and 
smooth. vi is the leading eigenvector of A, and a 
minimum of Fa (as well as any re-scaling of vi). 
Wo is a nearby unit vector, and we consider the 
intersection of a hyperplane orthogonal to wq, 
and an Euclidean ball centered at wq. 


The determinant of this 2x2 matrix equals 

4 


. 2 , 2^6 {'wfw'liSwl - w1){wl - 3mi) - Awfwiiwl - wlf) 

[wf -h W2F 

= ( 2^^2n6 - 3 ^ 2 ) - - wlf) 

+ W 2 F 


Awfwl 
{wf + wlf 


{-{wl + wlf) = - 


Awfwl 

{wi+wir 


which is always non-positive, and strictly negative for w for which W 1 W 2 7 ^ 0 (which holds for all but a 
measure-zero set of M'^). Since the determinant of a positive semidefinite matrix is always non-negative, this 
implies that the Hessian isn’t positive semidefinite for any such w. □ 


The theorem implies that we indeed cannot use convex optimization tools as-is on the function Fa, even 
if we’re close to an optimum. However, the non-convexity was shown for Fa as a function over the entire 
Euclidean space, so the result does not preclude the possibility of having convexity on a more constrained, 
lower-dimensional set. In fact, this is what we are going to do next: We will show that if we are given some 
point Wq close enough to an optimum, then we can explicitly construct a simple convex set, such that 


• The set includes an optimal point of Fa- 

• The function Fa is C)(l)-smooth and A-strongly convex in that set. 
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This means that we can potentially use a two-stage approach: First, we use some existing algorithm (such 
as VR-PCA) to find wq, and then switch to a convex optimization algorithm designed to handle functions 
with a finite sum structure (such as Fa)- Since the runtime of such algorithms scale better than VR-PCA, in 
terms of the dependence on A, we can hope for an overall runtime improvement. 

Unfortunately, this has a catch: To make it work, we need to have wq very close to the optimum - in 
fact, we require || vi — wo|| < 0{X), and we show (in Theorem|^ that such a dependence on the eigengap A 
cannot be avoided (perhaps up to a small polynomial factor). The issue is that the runtime to get such a wq, 
using stochastic-based approaches we are aware of, would scale at least quadratically with 1/A, but getting 
dependence better than quadratic was our problem to begin with. For example, the runtime guarantee using 
VR-PCA to get such a point wq (even if we start from a good point as specified in Theorem [T]) is on fhe 
order of 

<i(n+T)log(l), 

whereas the best known guarantees on getting an e-optimal solution for A-strongly convex and smooth 
functions (see [Tj) is on the order of 

<!(»+/f)log(l). 

Therefore, the total runtime we can hope for would be on the order of 



In comparison, the runtime guarantee of using just VR-PCA to get an e-accurate solution is on the order of 

^ + 1(2 ) log (^) • (9) 

Unfortunately, Eq. Q is the same as Eq. ([^ up to log-factors, and the difference is not significant unless 
the required accuracy e is extremely small (exponentially small in n, 1/A). Therefore, our construction is 
mostly of theoretical interest. However, it still shows that asymptotically, as e —)• 0, it is indeed possible 
to have runtime scaling better than Eq. 0. This might hint that designing practical algorithms, with better 
runtime guarantees for our problem, may indeed be possible. 

To explain our construction, we need to consider two convex sets: Given a unit vector wq, define the 
hyperplane tangent to wq, 

= {w : (w,wo) = 1 } 

as well as a Euclidean ball of radius r centered at wq: 

B^voir) = {w : ||w - wo|| < r} 

The convex set we use, given such a wq, is simply the intersection of the two, H-^o H Hwq (j’), where r is a 
sufficiently small number (see Eigure0. 

The following theorem shows that if wq is O (A)-close to an optimal point (a leading eigenvector vi of 
A), and we choose the radius of (r) appropriately, then n (r) contains an optimal point, and 
the function Fa is indeed A-strongly convex and smooth on that set. Eor simplicity, we will assume that A 
is scaled to have spectral norm of 1, but the result can be easily generalized. 
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Theorem 4. For any positive semidefinite A with spectral norm 1, eigengap A and a leading eigenvector 
vi, and any unit vector wq such that ||wo — vi|| < the function Fa{w) is 20-smooth and \-strongly 
convex on the convex set i/wo ^ ^wq (^)> which contains a global optimum of Fa- 


The proof of the theorem appears in Subsection 6.3 Finally, we show below that a polynomial depen¬ 
dence on the eigengap A is unavoidable, in the sense that the convexity property is lost if w^o is significantly 
further away from vi. 


Theorem 5. For any A,e G (O, there exists a positive semidefinite matrix A with spectral norm 1, 
eigengap A, and leading eigenvector vi, as well as a unit vector wo/or which || vi — wq II ^ -\/2(l + e)A), 
such that Fa is not convex in any neighborhood ofwQ on iiwo- 

Proof Let 

/I 0 0\ 

A = \ 0 1 - A 0 , 

Vo 0 0 / 

for which vi = (1, 0, 0), and take 

wo = (y/l -p2,0,p), 

where p = y^( 1 + e)A (which ensures ||vi—wo|p = s/2p‘^ = y^ 2(1 + e)A). Consider the ray {(y^l — p‘^,t,p) : 
t > 0}, and note that it starts from wq and lies in iiwo- The function Fa along that ray (considering it as a 
function of t) is of the form 


(1-p^) + (1 - A)f^ _ l-p^ + (l-A)f^ 

(1—p^) + f2+p2 1 + 


The second derivative with respect to t equals 

(3t2-i)(A-p2) ^ (3f^-l)6A 

(f2 + 1)3 (t2 + 1)3 ’ 


where we plugged in the definition of p. This is a negative quantity for any t < ^. Therefore, the function 
Fa is strictly concave (and not convex) along the ray we have defined and close enough to wq, and therefore 
isn’t convex in any neighborhood of wq on iiwo- D 


6 Proofs 


6.1 Proof of Theorem [T] 

Although the proof structure generally mimics the proof of Theorem 1 in ifT^ for the k = 1 special case, 
it is more intricate and requires several new technical tools. To streamline the presentation of the proof, we 


begin with proving a series of auxiliary lemmas in Subsection 6.1.1 and then move to the main proof in 


Subsection 6.1 The main proof itself is divided into several steps, each constituting one or more lemmas. 

Throughout the proof, we use the well-known facts that for all matrices B, C, D of suitable dimensions, 
Tr(B + C) = Tv{B) + Tr(C), Tt{BC) = Tv{CB), Tt:{BCD) = Tt{DBC), and Tr{B^B) = \\B 
Moreover, since Tr is a linear operation, E[Tr(ii)] = E[Tr(ii)] for a random matrix B. 


2 

F- 
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6.1.1 Auxiliary Lemmas 

Lemma 2. For any B,C,D F 0, it holds that Ti{BC) > Tr(i?(C' — D)) and Tt{BC) > Tr{{B — D)C). 

Proof. It is enough to prove that for any positive semidefinite matrices E, G, it holds that Tr(£'G) > 0. The 
lemma follows hy taking either E = B,G = D {in which case, Tt{BG) = Tr{B{G — D)) + Ft{BD) > 
Tr{B{G-D))),oxE = D,G = C (in which case, Ti'(5C') = Tt{{B-D)G)+Tt{DG) > Tt{{B-D)G)). 
Any positive semidefinite matrix M can be written as the product for some symmetric matrix 

(known as the matrix square root of M). Therefore, 

Tr{EG) = Tr(^i/2^^/2(^i/2^i/2^ ^ rj,j.(^i/2^i/2^i/2(^i/2) 

= Tr((£’^/2(^l/2)T('^l/2(^l/2)^ ^ ||^1/2(^1/2||2^ > q_ 


□ 


Lemma 3 . If B F 0 and C F 0, then 


Tv{BG-^) >Ti{B{2I -G)), 


where I is the identity matrix. 


Proof. We begin by proving the one-dimensional case, where B, G are scalars 6 > 0, c > 0. The inequality 
then becomes bc~^ > 6(2 — c), which is equivalent to 1 > c(2 — c), or upon rearranging, (c — 1)^ > 0, 
which trivially holds. 

Turning to the general case, we note that by Lemma|^ it is enough to prove that G~^ — {21 — G) F Q. 
To prove this, we make a couple of observations. The positive definite matrix G (like any positive definite 
matrix) has a singular value decomposition which can be written as USU~^, where U is an orthogonal matrix, 
and S' is a diagonal matrix with positive entries. Its inverse is US~^U~^, and 21 — G = 21 — USU~^ = 
U{21 -S)U^. Therefore, 

G-^ - {21 -G) = US-^U^ - U{21 - S)U^ = (7(5"^ - {21 - S))U^. 

To show this matrix is positive semidefinite, it is enough to show that each diagonal entry of S“^ — {21 — S) 
is non-negative. But this reduces to the one-dimensional result we already proved, when 6=1 and c > 0 is 
any diagonal entry in S. Therefore, G~^ — {21 — G) F 0, from which the result follows. □ 

Lemma 4. For any matrices B, G, 

Tt{BG) < ||i?||F||C||i. 

and 

\\BG\\f < ||f?|Up||C7||i.. 

Proof. The first inequality is immediate from Cauchy-Shwartz. As to the second inequality, letting c* denote 
the i-th column of G, and || • ||2 the Euclidean norm for vectors. 


\\BG\\f 




B\\sp\\C\\f. 


□ 
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Lemma 5. Let Bi, B 2 , Zi, Z 2 be k x k square matrices, where Bi, B 2 are fixed and Zi, Z 2 are stochastic 
and zero-mean (i.e. their expectation is the all-zeros matrix). Furthermore, suppose that for some fixed 
a,'y,5 > 0, it holds with probability 1 that 

• For all V G [0,1], B 2 + VZ 2 ^ bl. 

• max{||Zi||F, ||^2 ||f} < a- 


• \\Bi + rjZiWsp < 7. 


Then 


E [Tr {{B, + Zi){B 2 + ^ 2 )"^)] > Tr{BiBf^) 


0^(1 + j/S) 


Proof. Define the function 

f{v) = Tr {{Bi + vZi){B 2 + uZ 2 )-^) , i/ G [0,1], 

Since B 2 + i'Z 2 is positive definite, it is always invertible, hence /(u) is indeed well-defined. Moreover, it 
can be differentiated with respect to u, and we have 

f'{u) = Tr {Zi{B 2 + uZ 2 )-^ - {Bi + uZi){B 2 + vZ 2 )-^Z 2 {B 2 + uZ 2 )-^) . 

Again differentiating with respect to v, we have 


f''{v) = Ti(^-2Zi{B2 + vZ2)-^Z2{B2 + uZ2)-^ 

-\-2{Bi 1 'Zi){B2 UZ 2 ) ^Z2{B2 + 1'Z2) ^Z2{B2FuZ2) 

—Zi + (Bi + iyZi)(B2 + i'Z2) ^Z'^{B2 + i'Z2) ^^ 2(^2 + -^ 2 ) 


= 2Tr 


Using Lemmaj^and the triangle inequality, this is at most 


2|| — .Zi + (i?i + i/E'i)(i?2 T ^^-^ 2 ) ^-^211^11 (.^2 + ^'^ 2 ) "'■Z2(i?2 T ^^-^ 2 ) ^||f 


-1 / 




<2[\\Zi\\f+\\{Bi + vZi)[B2 + uZ2)-^Z2\\f) \\{B2 + uZ2) 


- 1||2 
I sp 


\\Z2\\l 


— 2 ( II^iIIf + ||.Bi + z^-2’i||sp|| {B 2 + VZ 2 ) 


-1 


|5p| 


||^2||f \\{B2 + vZ2) 


- 1||2 

lisp 


ll^2||j 


1 \ 1 2 a^(l + 7 /( 5 ) 

<2(« + 7^« pa = 


Applying a Taylor expansion to /(•) around u = 0, with a Lagrangian remainder term, and substituting the 
values for f'{v), we can lower bound /(I) as follows: 

/(I) > /(o) + /'(O) * (1 - 0) - ^ max|/"(zy)| * (1 - 0)^ 

2 b' 

= TV {BiBf^) + Tr [ZiBf^ - BiBf^Z 2 Bf^) - 
Taking expectation over Zi, Z 2 , and recalling they are zero-mean, we get that 

E[/(l)] > Tr [B^Bf^) - 

Since E[/(l)] = E [Tr {{Bi + Zi){B 2 + ^ 2 )“^)], the result in the lemma follows. □ 
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Lemma 6. Let Ui,... ,Uk and Ri , i ?2 be positive semidefinite matrices, such that R 2 — Ri ^ 0, and define 
the function 


-IN 


f{xi ...Xk)=Tr E XiUi + Ri E XiUi + R 2 


\i=l 


\i=l 


over all (xi ... x^) G [a, fiY for some /3 > a > 0. Then /(x) = f{a, 

Proof Taking a partial derivative of / with respeet to some Xj, we have 


, a . 




-1 


-IN 


Tr Uj + 


^i=l 


vi=l 


-IN 


- I + I j;rc.t/, + R2 j f^dE XiUi + i?2 

/ fc \ “^\ / A: 

= Tr 1^/ - C/i + i?i j XiC/i + i ?2 j j Uj XiU, + R 2 

= Tt iii ^ XiUi + i?2^ — i ^ ajjC/j + ^ ^ xfJi + R 2 XiUi + R 2 


-IN 


vi=l 


^*=1 


^ i=l 


\i=l 


-1 


-IN 


Tr |^(i? 2 -i?i) 

IS XiUi + i?2 J Uj XiUi + i?2 J 

By the lemma’s assumptions, eaeh matrix in the produet above is positive semidefinite, henee the produet 

le 

□ 


is positive semidefinite, and the traee is non-negative. Therefore, ^f{x) > 0, whieh implies that the 


funetion is minimized when eaeh Xj takes its smallest possible value, i.e. a. 
Lemma 7. Let B be a k x k matrix with minimal singular value 6. Then 


1 - 


\B\\l 


^ > max\l-\\B\\l , — {k-\\B\\l) 


Proof We have 


1 - 


WB'^Bwi ^ . wBwim^ 


so it remains to prove 1 — 


■p^ 


\B\\l - 

52 


> 1 - 


\B\\l 


= l-\\B\ 




> ^ (fc — Let iTi,..., cjfc denote the veetor of singular values 

of B. The singular values of B^ B are erf,, af, and the Frobenius norm of a matrix equals the Euelidean 
norm of its veetor of singular values. Therefore, the lemma is equivalent to requiring 


i_ELv!f > 


^-E 


2=1 


assuming cjj G [5,1] for all i. This holds sinee 


1 - 


E.<^f _ _ T..diV-4) ^ _ N 


2 

i 






2 
i 




2 
i 


k-Y,^ 


= J 


□ 
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Lemma 8. For any d x k matrices C, D with orthonormal columns, let 

Dc = avg min \\C — DBW'jp 

DB : (DB)^ {DB)=I 

be the nearest orthonormal-columns matrix to C in the column space of D (where B is a kxk matrix). Then 
the matrix B minimizing the above equals B = VU~^, where D = USV~^ is the SVD decomposition of 
D, and it holds that 

\\C-Dc\\l<2{k-\\C'^D\\l). 

Proof Since D has orthonormal columns, we have D = I, so the definition of B is equivalent to 

B = aig min \\C — DB\\%. 

B :bt b=i 

This is the orthogonal Procrustes problem (see e.g. O), and the solution is easily shown to be i? = VU~^ 
where USV~^ is the SVD decomposition of D. In this case, and using the fact that IICIII. = ||-D|||’ = k 
(as C, D have orthonormal columns), we have that \\C — DcW’^p equals 

\\C - DB\\\ = \\Cfp + ||L>||^ -2Tt{C^DB) = 2 (k - Tt{USV^(VU^))^ = 2 (k - Tt{USU^)^ . 

Since the trace function is similarity-invariant, this equals 2k — Tr(5). Let si..., be the diagonal ele¬ 
ments of S, and note that they can be at most 1 (since they are the singular values of D, and both C and 
D have orthonormal columns). Recalling that the Frobenius norm equals the Euclidean norm of the singular 
values, we can therefore upper bound the above as follows: 


2{k- Ti{USU^)^ 


2{k- Tr(5)) 


k-Y^sA <2 fe -= 2ik-\\C'^D 


2 = 1 


2=1 



□ 

Lemma 9. Let Wt, W[ be as defined in Algorithm^ where we assume p < ^. Then for any d x k matrix 
Vk with orthonormal columns, it holds that 


W^WtAp-Wv^Wt-i 


< 


I2kp 
1 - 3p’ 


Proof Letting s^, S(_i denote the vectors of singular values of V)J Wt and V)J Wt-i, and noting that they 
are both in [0,1]^ (as I 4 , Wt-i, Wt all have orthonormal columns), the left hand side of the inequality in the 
lemma statement equals 

\\\StA - ||st_i|p| = (||st||2 -h ||st_i||2) I ||st||2 - ||st_i||2 | < 2 \/fe||St - St-l\\2 < 2 k\\st - St_i||oo, 

where || • ||oo is the infinity norm. By Weyl’s matrix perturbation theorerr|^|[T2l, this is upper bounded by 

2k\\vAWt - vAWt-iWsp < 2k\\Vk\\sp\\Wt - Wt-i\\sp < 2k\\Wt - Wt-i\\sp. (10) 

^Using its version for singular values, which implies that the singular values of matrices B and B + E are different by at most 

WEWs,- 
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Recalling the relationship between Wt and Wt-i from Algorithmwe have that 

Wl = Wt-i + rfN, 


where 


sp ^ 1 lisp “h ll^it^j^ l^s—11 


sp “f 

n 


XjXj lisp < 3, 


i=l 


as Wt-i, Ws-i-, Bt-i all have orthonormal columns, and XjjX^ and ^ X]r=i ^*^7 have spectral norm at 
most 1. Therefore, W/ equals Wt-i, up to a matrix perturbation of spectral norm at most 3r/. Again by 
WeyTs theorem, this implies that the k non-zero singular values of the d x k matrix Wl are different from 
those of Wt-i (which has orthonormal columns) by at most Srj, and hence all lie in [1 — St], 1 + 3??]. As a 

,-V2 


result, the singular values of 


wjwi 


all lie in 


1 1 
1-1-3p’ l-3p 


. Collecting these observations, we have 


/ ,-r \ - 1/2 

IllCt - VCi-ilIsp = \\{Wt-i+vN) (vcJiWi'.i) - Wi_i||sp 

< ||m_i +r?iV (w'j,wuy"^‘' Wsp 

/ ,-r \-l/2 / ,-r \-l/2 

< II [W,J,WU) - I\\sp + r/||A^||sp|| [Wj,WU) lisp 

^ 3r] 3r] Qrj 

~ 1 — 3rj^ 1 — 3rj 1 — 3r/ 

Plugging back to Eq. ([T0|), the result follows. □ 


6.1.2 Main Proof 

To simplify the technical derivations, note that the algorithm remains the same if we divide each Xj by 
and multiply rj by r. Since maxj ||xj|p < r, this corresponds to running the algorithm with step-size r/r 
rather than rj, on a re-scaled dataset of points with squared norm at most 1, and with an eigengap of A/r 
instead of A. Therefore, we can simply analyze the algorithm assuming that max* ||xj|p < 1, and in the end 
plug in A/r instead of A, and rjr instead of tj, to get a result which holds for data with squared norm at most 
r. 


Part I: Establishing a Stochastic Recurrence Relation 


We begin by focusing on a single iteration t of the algorithm, and analyze how || V)J WtW’^p (which measures 
the similarity between the column spaces of 14 and Wt) evolves during that iteration. The key result we 


need is Lemma 10 below, which is specialized for our algorithm in Lemma 11 


Lemma 10. Let A be a d x d symmetric matrix with all eigenvalues si > S 2 >■■■> Sd in [0, 1], and 
suppose that Sk — Sk+i > Xfor some A > 0. 

Let N be a d X k zero-mean random matrix such that || A^||f 4 crfj and || A^||sp < with probability 
1 , and define 


tn = 46 (cr^)^ 





2 
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Let W be a d X k matrix with orthonormal columns, and define 


W'= {I + t]A)W + rjN , W” = 


for some rj G 


0 , 


4max{l,(j^} 

= [vi, V 2 ..., Vfc] is the d x k matrix of A’s first k eigenvectors, then the following holds: 

. E [1 - WV^W'Tf] < (1 - IdXWV^WWl) (1 - \\v^w\\l) + 

• VWk ^IIf 


Eat 


k-\\V^W"\\l 


< [k-\\v;jw\\l) +ri^rM. 


Proof Using the fact that At{BCD) = Tt{CDB) for any matrices B, C, D, we have 


E 


\v^w"\\l 


= E 
= E 

= E 


Ai[w''^VkV;jw' 


Trl iW 'W 


- 1 / 2 , 


W ' VkV^W' (W '^W' 

-1 


TV ( (TU W']{W ^W' 


A-V2 


By definition of W', we have 

W'^VkV^W' = ((/ + 7]A)W + r/N)^ VkV^ ((/ + rjA)W + rjN) 


where we define 


— Bi + Zi, 

Bi = W^{I + r]A)VkViJ (/ + ijA)W + rfN'^VkV^ N 
Zi = riN^VkViJ (/ + r]A)W + r]W^ {I + r]A)VkV^ N. 


Also, we have 


W''W' = ((/ + rjA)W + r]N)^ ((/ + riA)W + riN) 

= B 2 + Z 2 , 


where 


B 2 = W'^{I + r]A){I + r]A)W + 

Z 2 = + t^A)W + r]W^{I + r]A)N. 


( 11 ) 


With these definitions, we can rewrite Eq. (11 1 as E [Tr((i?i + Zi){B 2 + Z 2 ) ^)]. We now wish to remove 
Zi, Z 2 , by applying Lemma]^ To do so, we check the lemma’s conditions: 


Zi, Z 2 are zero mean: This holds since they are linear in N, and N is assumed to be zero-mean. 
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• B 2 + vZ 2 ^ |/ for all v G [0,1]: Recalling the definition of i? 2 , Z 2 , and the facts that ^ ^ 0, 
N~^N y 0 (by construction), and W~^W = I, we have that B 2 ^ I. Moreover, the spectral norm of 
Z 2 is at most 

2r]\\N^{I + vA)W\\sp < 27/||iV|Up||/ + rjA\\sp\\W\\sp < 2r?^7^^(l + 77 ) < 2r,a^il + 77 ), 

which by the assumption on rj is at most 2|(l + j) = |. This implies that the smallest singular value 
of B 2 + iyZ 2 is at least 1 — z^(5/8) >3/8. 

• max{||Zi ||i?, 11 ^ 211 ^} < By definition of Zi, Z 2 , and using Lemmathe Frobenius norm of 

these two matrices is at most 

277||iV||F||(/ + r]A)\\sp\\W\\sp < 277cj^(1 + 77 ), 

which by the assumption on rj is at most 2r]af^ (l + j) = 

• \\Bi + r]Zi lisp < (^<7^ + 2 )^: Using the definition of Bi, Z\ and the assumption 77 < |, 

||i3i + r]Zi\\sp < ||i3i||sp + 77||Zi||sp 

<(1 + 77)2 + 772 ( 77^^)2 + 27777^(1 + 77 ) 


Applying Lemma and plugging back to Eq. ( [TT] ), we get 

E [||'t/7VF"|||] > E [Tl'((Si + Zi){B2 + Za)-^)] 

> Tr {BiBf^) - ^1 + ^ Q(7^ ^ ) 


( 12 ) 


We now turn to lower bound Tr {B 1 B 2 ^), by first re-writing Bi, B 2 in a different form. For i = 1,... ,d, 
let 

u, = iu^v,v7iy, 

where Vj is the eigenvector of A corresponding to the eigenvalue s*. Note that each Ui is positive semidefi- 
nite, and Ui = W^W = I. We have 


Bi = W^{I + r]A)VkV^ (I + 77A)1U + ifN^VuV^ N 
= ((/ + rjA)Vk) ((/ + T/A)Vk)^ W + r/^N^VkVk^N 

k 

= J](l + vsifW^^ri^rJ lU + ri^N^VkV,J N 

i=l 

k 

= Y^{l + r,SifUi + r,^N^VkVk^N. ( 13 ) 

i=l 
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Similarly, 


B 2 = W^{I + r]A){I + 7^A)W + yfN^N 

d 

= ^(1 + rjSifW^VivJ W + 

i=l 

d 

= ^(1 + rjSifUi + rj^N^N. (14) 

i=l 


Plugging Eq. ([T3]) and Eq. ([T4]) back into Eq. ([T^, we get 


E 


WV^W'^Wl] >Tr i^{l + r]Sl)^Ui + r]^N^VkV^^N] (Y,i^ + VSi?Ui + r]^N^N 


K i=l 


■k 2 = 1 


400 f sp , n 


(15) 


Recalling that si > S 2 > ■ ■ ■ > Sk and letting a = (1 + rjSkY,13 = (1 + the trace term can be 

lower bounded by 


min Tr V x^U^ + ifN^VkY^N] V + V (1 + r^SifUi + r^N^N 
a;i,...,XfcS[Q:,/3] * ' ^^ ^^ ^^ 


, i=l 


,i=l i=k+l 


Applying Eemma|^(noting that as required by the lemma, Jjf=k+i(^3-r/Si)^l7i+'ri^N~^ N—r}^N~^VkV/J N 
Yli=k+ii^ + VSi)‘^Ui + rj‘^N~^ (/ — VfcV^J) N A 0), we can lower bound the above by 

Tr [ ("(1 + rjskf + V^N^ykVk^ A f (1 + VSk? + E (^ + 

y V i=l J \ i=l i=k+l J 

Using Eemma]^ this can be lower bounded by 


Tr I ( (1 + rjskf 

i=l ) V i=l i=k+l / 

o \ -1^ 


(E''.+ E (ft^Vc/< + 

' \i=i / \i=i i=k+i ' 


+ VSk 


1 + r]Sk 


^ ' N^N 


Applying Eemmaj^ this is at least 








Recalling that I = Yli=i Ui = Yl\=i + Y^'i=k+i simplified to 


\2=1 i=k+l \ 


V 2=1 


1 + r/Sj 
1 +7?Sfe 


Ui- 


—^— ] N^N 

1 + 7?Sfe, 


( 16 ) 


Since Ui E 0, then using Lemma we can lower bound the expression above by shrinking each of the 
2 — ^ i+^Sfc ) ] terrns- Iii particular, since s* < Sfc — A for each i>k + l. 


2 - 


1 + ysi 
1 + r/Sfc 


> 2 - ^ + > 2 - ^ + ~ ^ 1 


1 + rjSk 


1 + rjSk 


1 + f?Sfc ’ 


which by the assumption that rj <1/4 and Sfc < si < 1, is at least 1 + |r/A. Plugging this back into Eq. (161, 
and recalling that Yli=i •^he lower bound 


-— 1 N^N 

+ VSk 


'T' E*^- Ef'-+ E (‘ + 5’'^)^‘- I 


y2=l / \2=1 i=k-\-l 


\i=l 


2=1 


—^ 1 N^N 

l + vsk. 


Again using Lemma this is at least 


E^'O ' + ;;')M'-E^- 


V 2 = 1 


2 = 1 


V 

1 + VSk 


Tr iV^iV 


y2=l 


> 




^ 2 = 1 
^ k 


2 = 1 
k 


—^^ Tr(N^N) 

l + r]SkJ ^ 


>11' Et^d /+j^A /-Ec'. ■ 


^ 2 = 1 


2 = 1 


Recall that this is a lower bound on the trace term in Eq. ( [T5] ). Plugging it back and slightly simplifying, we 
get 


E 


\V^W"\\l\ >Tr j+ .A 




where 


V 2=1 


r^ = 46 l + ri-af+ 2 


2=1 

2 \ 
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The trace term above can be re-written (using the definition of Ui and the fact that B) = ||-B|||n) as 

Tr ^ v,v7 ) 

= (^1 + ^7?A^ Tr (w^VkV^w) - ^r/ATr ({w^VkV^Jw) {w^VkV^W^^ 


= ( 1 + ^r?A] ||yT^11^ _ ^r^xWW^VkV^W\\l 


= ||F,TfT||Ul + -r?A 1- 


\W^VkV,JW\\l 

^'"V \\v^w\\l 

Applying Lemma and letting 6 denote the minimal singular value of VjJ W, this is lower bounded by 

\\V^W\\l (l + ^r/Amax |l - j {k - \\V;!W\\l) |) . 

Overall, we get that 

E [lln^ W'Wl] > ||F,^ W\\l (l + ^r?Amax |l - ||yT w\\l , y (A: - W\\l) |) - r?V^. (17) 

We now consider two options: 

• Taking the first argument of the max term in Eq. ( [TT] ), we get 

E W\\l (^1 + ^r?A (l - \\V^W\\l)'^ - rfvN. 

Subtracting 1 from both sides and simplifying, we get 


E 


1- ||LJlL"||i 


< (l - ^r?A||Lfc^ W\\l^ (l - \\V^W\\l) + r,\M. 


E 


k IlF - 2 

WYf] > \\V^WWf (l + ^ {k - Wfp)'^ - 


Suppose that WVjJW\\ l>k-\. Taking the second argument of the max term in Eq. dlvL we get 

ArfXd"^ 


Subtracting both sides from k, , we get 


E 


k-\\v;!w"\\l] < (a:-||lTve|| 


Twi|2 \ in^Twl|2 / . iit.^Tti/||2 


< (k-\\v;^w\\i) (i- 


hk 
4r?A(52 
5k 

5k 


\\V^Wfp[k-\\V^WrF]+v'rN 

= (k- ||lT w\\l) (l - W\\l) + 


A; - - ) ] + V rN 
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Since A; > 1, we can lower bound the (A: — term by Moreover, the condition A: — || 14 ^ | 

implies that the singular values iti, ..., of W satisfy k — Yli=i ^ cjj is in [ 0 , 1 ] 

(as 14 , W have orthonormal columns), so no Oi can be less than This implies that 5 > 5 . Plugging 
the lower bounds k-\> I and > 2 into th® above, we get 


E 


k-\\V^W"\\l\ < [k-\\V^W\\l) ( 


1 - 


Lemma 11. Let A, Wt be as defined in Algorithm^^ and suppose that p G 
holds for some positive numerical constants ci, C 2 , c^: 

. ¥.[i-\\v;!w"\\l] < {l-cip\\\V^W\\l) {l-\\V;!W\\l)+C 2 kp^ 

. If\\V;jWt\\l >k-i then 


0 , 


23\/fc 


□ 


Then the following 


E 


k-\\V,'Wt+i\\F < {k-\\V^WtrF){l-cip{X-C2v)) + C3r?4A:-||y,'iy._i||^). 


In the above, the expectation is over the random draw of the index it, conditioned on Wt and Wg-i- 


Proof To apply Lemma 10 we need to compute upper bounds and on the Frobenius and spectral 
norms of N, which in our case equals (xj^x^ — A){Wt — Wg-iBt). Since ||4||sp, ||xjjX^||sp < 1 , and 
Wt, Ws-i,Bt have orthonormal columns, the spectral norm of N is at most 


^pxl-A)iWt-Ws-iBt) 


\sp ^ 


^it I Up “I" 


sp 


sp T II PPs—1 lUp II -Bi II sp j ^ ‘i, 


< 4, 


so we may take = 4. As to the Frobenius norm, using Lemma and a similar calculation, we have 

\l<4\\Wt-Ws-iBt\\l. 


To upper bound this, define 

IV, = arg min \\Wt - VkB\\p 

VkB-.iVkBV{VkB)=I 

to be the nearest orthonormal-columns matrix to Wt in the column space of I4, and 

Wv = arg _ min ||Viy, - Ws-iB\\p 

Ws-iB:{Ws-iBfi{Ws-iB)=I 

to be the nearest orthonormal-columns matrix to IVt i^ column space of Wg-i- Also, recall that by 
definition, 

Ws-iBt = arg min \\Wt - Ws-iB\\p 

Ws-iB-.{Ws-iBfi {Ws-iB)=I 

is the nearest orthonormal-columns matrix to Wt in the column space of Wg-i- Therefore, we must have 
\\Wt — Ws-iBt\W < II Wt — Wylll’- Using this andLemmaj^ we have 

\\Wt-Ws-iBtfp < IIW-Wylll 

= mt-Vw,)-{Wv-VwMl 

< 2|| Wt - VwAl + 2||Wy - VwAl 

= 4 (fc - \\V^ Will) +A(k- IIW_i|||) . 
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By definition of Vwt^ we have Vwt = ^kB where B = B'^V^V^B = (VkB)~^(V^B) = I. Therefore B 
is an orthogonal k x k matrix, and 11 Ws-i 11= 11 Vj! Ws-i Up = 11 VfJ Ws-i 11 1’, so the above equals 
4{k — 1114 ^ IIf) + 4(/c — 1114 ^ Ws-i in’)- Overall, we get that the squared Frobenius norm of N can be 
upper bounded by 


K)2 = 16 ( (fe - 1114 ' Wt^F) + {k- \\V,' Ws-lfF 


Tt 


Plugging (T^ and into the tat as defined in Lemma 10 and picking any r/ G [0, (which satisfies 

, since 4max{l,fT4} < 4max{l, \/l6 * 2k} < 


the condition in Lemma 
23Vk), we get 


10 


that r] G 


0 , 


1 


4max{l,cr^} 


rM = 736 {k - 1114 ' WtWj.) + ik- II 44 ' Ws-iWi 


1 + 


8 /I 


3 V4 


r4 + 2 


< 18400 ({k - llVfc^TLill^) + ik- II 44 ' 4F.-i||;^ 


Ti 


This implies that rjv < 36800A: always, which by application of Lemma[T0| gives the first part of our lemma. 
As to the second part, assuming ||44^ 444|||' > A: — ^ and applying Lemma 10 we get that 


E 


k - II 44 T Wt+iWl] < (fc - II 44 T44^*111) (1 - 


+ 18400 r/" ((A - ||44^44"t||^) + {k - ||44' 144-i||f 


T, 


= (fc-||44'44^i|lF, 


t) (1 - 7? 184007?)) 

+ 18400 rj\k-\\V^Ws-i\\l). 

This corresponds to the lemma statement. 


□ 


Part II: Solving the Recurrence Relation for a Single Epoch 

Since we focus on a single epoch, we drop the subscript from l+s_i and denote it simply as W. 

Suppose that rj = aX, where a is a sufficiently small constant to be chosen later. Also, let 

64 = A; - 1144^444111. and 5 = A: - ||44^4L||1,. 

Then Lemma[^tells us that if a is a sufficiently small constant, ht < then 

^[ht+i\Wt] < {l-caX^)bt + ca^X% (18) 

for some numerical constants c, d. 

Lemma 12. Let B be the event that bt < ^for all t = 0,1,2,, m. Then for certain positive numerical 
constants ci,C2, C3, if a < ci, then 

^[bm\B] < ((1 - C 2 aA^)™' + c^a) b, 

where the expectation is over the randomness in the current epoch. 
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Proof. Recall that bt is a deterministic function of the random variable Wt, which depends in turn on Wt-i 
and the random instance chosen at round t. We assume that Wq (and hence b) are fixed, and consider how 
bt evolves as a function of t. Using Eq. ([T^, we have 


¥.[bt+i\Wt,B]=¥. 


bt+i\Wt, bt+i < - 


< ¥.[bt+i\Wt] < {l-ca\^)bt + c'a^X%. 

,bt, so the event B 


Note that the first equality holds, since conditioned on Wt, bt+i is independent of bi 
is equivalent to just requiring bt+i <1/2. 

Taking expectation over Wt (conditioned on B), we get that 

E[bt+i\B] < E\{l-caX^)bt + c'a‘^XH B 
= (1 - caX^) E [bt\B] + c'a^X^b. 

Unwinding the recursion, and using that bo = b, we therefore get that 

m—1 

E[bm\B] < (l — caA^)"^ b + c'a^X% (l — caX^Y 


i=0 

OO 


< 


(l — caA^)”* b + c'a^X^b ^ (l — caA^)* 

i=0 

(l — caX‘^)"^b + c'a^X^b —^ 
(^l_caX^)^ + -a) b. 


as required. □ 

We now turn to prove that the event B assumed in Lemma [T^ indeed holds with high probability: 

Lemma 13 . The following holds for certain positive numerical constants ci, C2, 03.' If a < ci, then for any 
j3 G (0,1) and m, if 

b + C2kma^X‘^ + c^k^ma^X^ log(l//3) < (19) 

then it holds with probability at least 1 — /3 that 

bt < b + C2kma^X‘^ + coks/ma^X^ log(l//3) < ^ 

for all t = 11 , 1 , 2 ,, m. 

Proof. To prove the lemma, we analyze the stochastic process bo{= b),bi,b2,..., bm, and use a concentra¬ 
tion of measure argument. First, we collect the following facts: 

• b = bo < I : This directly follows from the assumption stated in the lemma. 

• As long as bt < \, E [6t+i|kU] < 6* + 020^X% for some constant C2. Supposing a is sufficiently 
small, then by Eq. ( [T^ , 

E[5t+i|lTt] < (1 — caA^) + c'a^A^5 < bt + ca^X^b. 
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• \ht+i — bt\ is bounded by c'^kaXfor some constant c'^: Applying Lemma]^ and assuming that a is at 
most some sufficiently small constant ci (e.g. a < ^, so rj = aX < 


\bt+i - bt 


V^Wt+i\\l-\\V^Wt\\l 


^ 12kr] ^ 12kaX 

- 1 - 3r] - 3/4 


16kaX. 


Armed with these facts, and using the maximal version of the Hoeffding-Azuma inequality jlTII . it follows 
that with probability at least 1—/3, it holds simultaneously for all f = 1,..., m (and for f = 0 by assumption) 


that 

bt <b + C 2 ma^X% + c^k^JmoP'X^ \og{l/(5) 


for some constants 02,03, as long as the expression above is less than If the expression is indeed less than 
i, then we get that bt <\ for all t. Upper bounding 6 by fc and slightly simplifying, we get the statement in 
the lemma. □ 


Combining Lemma [T^ and Lemma [T3| and using Markov’s inequality, we get the following corollary: 

Lemma 14. Let confidence parameters € (0,1) be fixed. Suppose that m,a are chosen such that 
a < Cl and 

b + C 2 kmQ^X^ + c^k^ma^X^ log(l//3) < 

where oi, 02 , 03 are certain positive numerical constants. Then with probability at least 1 — (/? + 7), it holds 
that 

< — ((1 — oaA^)"* + ca) b. 
for some positive numerical constants 0, fi. 


Part III: Analyzing the Entire Algorithm’s Run 


Given the analysis in Lemma [T4| for a single epoch, we are now ready to prove our theorem. Let 

bs = k-\\v;!ws\\i. 

By assumption, at the beginning of the first epoch, we have bo = k — ||V)JlUo|||^ < Therefore, by 
Lemma 


14 


for any /?, 7 G (O, 2), if we pick any 


1 


a < min ^ oi, ^7 f m > 


^ such that ^ + 02/cma^A^ + 03A:\/mQ;2A2 log(l//3) < 


2 ’ 

( 20 ) 


then we get with probability at least 1 — (/3 + 7) that 


1 / / ox 31og(l/7) 1 ~ 

bm < “ ( “ caA^) + -7^j bo 

Using the inequality (1 — (l/x))“^ < exp(—a), which holds for any x > 1 and any a, and taking x = 
1/(cq;A^) and a = 31og(l/7), we can upper bound the above by 




1 

7 


7 + ^ 7 '^ 


bo < ibo- 
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Since bm equals the starting point bi for the next epoch, we get that bi < 760 < Again applying 


Lemma 14 and performing the same calculation we have that with probability at least 1 — (/3 + 7 ) over the 
next epoch, 62 < 7^1 < 7 ^&o- Repeatedly applying Lemma 14 and using a union bound, we get that after 
T epochs, with probability at least 1 — r(/3 + 7 ), 


k-\\V^WT\\F = bT < f bo < 7 '"'. 


Therefore, for any desired accuracy parameter e, we simply need to use T = 


log(l/e) 

log(l/7) 


epochs, and get 


k — 1114 ^ Ws\W < e with probability at least 1—T(/3 + 7 ) = l — 

<5 


log(l/£) 

log(l/ 7 ) 


(/3 + 7)- 


Using a confidence parameter 6, we pick /3 = 7 = |, which ensures that the accuracy bound above 


holds with probability at least 

’log(l/e) 


1 - 


log(2/(5) 


5 > l- 


' log(l/e) 

log( 2 ) 


5 = 1- 


log2 ( - 


5. 


Substituting this choice of / 3,7 into Eq. (20 1 , and recalling that the step size 7 equals a A, we get that 
k — \\V^ lUrlll’ < e with probability at least 1 — [log 2 (l/e)] 5 , provided that 


7 < c5^\ , m > 


Ulog(2/5) 

rjX 


kmrf' + k\/mrf log(2/5) < d' 


for suitable positive constants c, c', c". 

To get the theorem statement, recall that the analysis we performed pertains to data whose squared norm 
is bounded by 1. By the reduction discussed at the beginning of the proof, we can apply it to data with 
squared norm at most r, by replacing A with A/r, and 7 with -qr, leading to the condition 




and establishing the theorem. 


m > 


d log(2/5) 
7 A 


krri'rf'r^ + rk\/mq^ log(2/5) < d' 


6.2 Proof of Theorem |2] 

The proof relies mainly on the techniques and lemmas of Section [6UJ used to prove Theorem [T] As done in 


Section 6.1 we will assume without loss of generality that r = max* ||xj |7 is at most 1, and then transform 


the bound to a bound for general r (see the discussion at the beginning of Subsection |6.L2| ) 

First, we extract the following result, which is essentially the first part of Lemma [TT] (for k = 1)\ 

Lemma 15. Let A, wt be as defined in Algorithm^ and suppose that 7 G [O, . Then 

Eq [1 - (vi,wt+i)^|wt,ws_i] < (1 - c 7 A(vi,w 4 ^) (l - (vi,' w*)^) + c' 7 ^, 

for some positive numerical constants c, d. 

Note that this bound holds regardless of what is w^-i, and in particular holds across different epochs 
of Algorithm [T] Therefore, it is enough to show that starting from some initial point wq, after sufficiently 
many stochastic updates as specified in line 6-10 of the algorithm (or in terms of the analysis, sufficiently 
many applications of Lemma 15 I, we end up with a point wy for which 1 — (vi, -wr) < |, as required. 
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Note that to simplify the notation, we will use here a single running index wq, wi, W2, ..., wy (whereas in 
the algorithm we restarted the indexing after every epoch). 

The proof is based on martingale arguments, quite similar to the ones in Subsection 6 . 1.2 but with slight 
changes. First, we let 

bt = l- 


to simplify notation. We note that bo = 1 — (vi, wq)^ is assumed fixed, whereas 6i, 62, • • • are random 
variables based on the sampling process. Lemma[^tells us that if rj is sufficiently small, and < 1 — ^ for 
some ^ G (0,1), then 

E[6t+i|6t] < (1 - cr/A^) + c'r/^. (21) 


for some numerical constants c, c'. 


Lemma 16. Let B be the event that < 1 — ^for all t = 0,1,... ,T. Then for certain positive numerical 
constants ci, C2, C3, if p < ciA, then 

E[bT\B] < (^(1-C2r/Aef+ C3^) . 

Proof. Using Eq. ( [ 27 ] ), we have for any bt satisfying event B that 

E[bt+i\bt,B] =E[bt+i\bt,bt+i < 1 - C] < E[bt+i\bt] < {l-cp\^)bt + dp^. 

Taking expectation over bt (conditioned on B), we get that 

E[6t+i|S] < E[{l-cp\i)bt + c!p^\B] 

= (1 - cpXd) E [bt\B] + cp^. 

Unwinding the recursion, we get 

T-l 

E[bT\B] < ii-cpXifbo + dp^Y.^^-^^^^'^' 

i=Q 

00 

< (1 — cpXdy^ + dp^ ^ (1 — cpXff 

i=0 

= {l-cpX(f + dp‘^^— < {1- cpX^f + -^. 

cpX^ c A4 


□ 

We now turn to prove that the event B assumed in Lemma[^indeed holds with high probability: 
Lemma 17. The following holds for certain positive numerical constants ci,C2, c^,: Ifp < ciA, then for any 

d e (0,1), if _ 

bo + C 2 Tp^ + ctis/Tp"^ log(l//3) < 1 - C, (22) 

then it holds with probability at least 1 — /? that 

bt < bo + C 2 Tp^ + ctt,y/Tp‘^ log(l//3) < 1-^ 

for all t = 0 , 1 ,... ,T. 
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Proof. To prove the lemma, we analyze the stochastic process bi,b2, ■ ■ ■, 6t> and use a concentration of 
measure argument. First, we collect the following facts: 

• ^0 < 1 ~ This directly follows from the assumption stated in the lemma. 

• E [bt+i\bt] < bt + dif^ for some constant d: By Eq. ( [2T] ), 

E[bt+i\Wt] < (1 - CT]XC)bt +dr]'^ < bt + dif. 


• I bt+i — bt I is bounded by crjfor some constant c: Applying Lemmaj^for the case k = 1 , and assuming 
V < 1 / 12 , 

\bt+i-bt\ = |(vi,wt+i)2 - (v,wt)2| < 

Armed with these facts, and using the maximal version of the Hoeffding-Azuma inequality iTTll . it follows 
that with probability at least 1 — it holds simultaneously for alH = 0,1,..., T that 

bt<bo + C2Trf + log(l// 3 ) 

for some constants 02,03. If the expression is indeed less than 1 — then we get that < 1 — ^ for all t, 
from which the lemma follows. □ 

Combining Lemma [T^ and Lemma [T 7 | and using Markov’s inequality, we get the following corollary: 

Lemma 18 . Let confidence parameters /?, 7 G ( 0 , 1 ) be fixed. Then for some positive numerical constants 
Cl, 02, 03, 0, o', if r] < oiA and 

bo + C2Tr]‘^ + cos/Trf log(l/^) < 1 - C, 
then with probability at least 1 — (/3 + 7), it holds that 


i-r < 1 ({1 - cnMf + c'f'j . 


We are now ready to prove our theorem. By Lemma 18 for any /?, 7 G (O, and any 


7 < min <j oi, —7^ and T> 


31 og(l/ 7 ) 
07 A^ 


such that bo + C2Tr]‘^ + coy/Trf log(l// 3 ) < 1 - 
we get with probability at least 1 — (/3 + 7) that 


( 23 ) 


7 


31og(l/7) ]_ 


br < - (1 - cpX^) + -j 


Using the inequality (1 — (l/x))“® < exp(—a), which holds for any x > 1 and any a, and taking x = 
l/{cr]X^) and a = 3 log(l/7), we can upper bound the above by 


1 

7 




- 7" + 77 ^ 
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and since we assume 7 < ^, this is at most Overall, we got that with probability at least 1 — /3 — 7 , 
&T < 5 ! and therefore 1 — (vi, ^ 5 as required. 


It remains to show that the parameter choices in Eq. (|23 1 can indeed be satisfied. First, we fix ^ 


(where we recall that 0 < C < (vi, wq)"^), which trivially ensures that = 1 — (vi, wq)"^ is at most 1 — 2 ^. 
Moreover, suppose we pick /3 = 7 in (0, exp(—1)), and rj, T so that 

c* 72AC^ 31og(l/7) 


r] < 


log2(l/7) 


T = 




(24) 


where c*, c'^ are sufficiently small constants so that the bounds on 77 , T in Eq. (231 are satisfied. This implies 
that the third bound in Eq. ( |2^ is also satisfied, since by plugging in the values / bounds of T and rj, and 
using the assumptions 7 = /? < exp(— 1 ) and ^ < 1 , we have 

bo + C 2 Tr]'^ + co^/Trf' log(l/ 7 ) 


^ 31 og(l/ 7 ) 

< 1 - 2^ + C2- r ] + C34 


< l- 2 e + C 2 - 

n ' 


c'A^ 


< 1 - 2 e + 


<log(l/7) 

3C2C 
ci 


+ C.o\ 


/ 31 og(l/ 7 ) 

^ 3 c * 72^2 


? 7 log(l/ 7 ) 



which is less than 1 — ^ if we pick c* sufficiently small compared to . 

To summarize, we get that for any 7 G (0,exp(—1)), by picking r\ as in Eq. (24i, we have that after 
T iterations (where T is specified in Eq. (p^), with probability at least 1 — 27 , we get wy such that 
1 — (vi, wr) < 1. Substituting <5 = 27 and ( = 2^, we get that if 


and r] satisfies 


(vi, Wo)^ > c > 0 , 

ci(5^AC^ 


r] < 


log'(2/5) 


(for some universal constant ci), then with probability at least 1 — 5, after 

C 2 log(2/5) 


T = 


r]XC 


stochastic iterations, we get a satisfactory point w-p. 

As discussed at the beginning of the proof, this analysis is valid assuming r = maxj ||xj||2 < 1 . By the 
reduction discussed at the beginning of Subsection 6.1.2[ we can get an analysis for any r by substituting 
A —)• A/r and rj —)• rjr. This means that we should pick rj satisfying 


r]r < 

and getting the required point after 

T = 


ci5^(A/r)C^ 


log'(2/5) 




ci5^AC^ 

r2 log^(2/5) ’ 


C 2 log(2/5) 


C 2 log(2/5) 

_{vr){X/r)C_ 


- 1 


Iterations. 
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6.3 Proof of Theorem |4] 


For simplicity of notation, we drop the a subscript from Fa, and refer simply to F. 

We first prove the following two auxiliary lemmas: 

Lemma 19. If A is a symmetric matrix, then the gradient of the function F{w) = — at some w 

equals 

(F(w)/ +A)w, 

||w|p 

and its Hessian equals 

where = B + B^ (i.e., a matrix B plus its transpose). 

Proof By the product and chain rules (using the fact that is a composition of w i—;■ || w|p and z i—)> ^), 

the gradient of F(w) = (w^^w) equals 

Wt^— fTJ () - (jdw) (25) 

||w|p V / ||w|p 

giving the gradient bound in the lemma statement after a few simplifications. 

Differentiating the vector-valued Eq. ( |25l ) with respect to w (using the product and chain rules, and the 
fact that is a composition of w i—)• ||w|p, 2 : 1 —)> z^, and 2 ; 1 —)> ^), we get that the Hessian of F equals 


(w''~^w) -|- w ( —-r—^ =t= 2||w||^ * 2w ) (w ' ) -|- w 


w 


w 


T 




( 2 Aw) 


-A, 


w 


- (Aw) - 


* 2w 


w 


2F(w) 8F(w) -r 

-n—+ -n—rrr-ww -|- 


4 T . 2 ^ 

—^ww'A - -— ^A + 


w 


w 


w 


r^dwW 


T 


W 


W 


W 


2F(w)I - 


8F(w) 


-ww — 


w 


4 T . . 4 . "r 

-TTTtWW a + 2A — - - TTTTjdwW 


w 


w 


which can be verified to equal the expression in the lemma statement (using the fact that A, ww^ and I are 
all symmetric matrices, hence equal their transpose). □ 

Lemma 20. Let wq, vi be two unit vectors such that ||wo — vi|| < e < ^ (which implies (wq, vi) > 0). 
Let be the intersection of the ray {avi : a > 0} with the hyperplane Ffwo = {w : (w, wq) = 1}. Then 
llv'i - Woll < fe. 

Proof See Figurej^in the main text for a graphical illustration. 

Letting = ov, a must satisfy (ovi, wq) = 1. Since vi, wq are unit vectors, this implies 

_ 1 _ 2 

(vi,wo) 2-||vi-woP’ 
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2 


and since ||vi — wo|| < e, this means that 


Therefore, 


a £ 


1 , 


2 -e^ 


||vi— wqII < ||vi — woll + ||v']^ — vi|| < e+||avi—vill < e + |a —1| < e + 
and since e < i, this is at most |e. 


2-e' 


-1 = e + 


2 + e' 


2 ’ 


□ 


We now turn to prove the theorem. Let (^v) denote the Hessian at some point ^v. To show smoothness 
and strong convexity as stated in the theorem, it is enough to fix some unit wq which is e-close to the leading 
eigenvector vi (where e is assumed to be sufficiently small), and show that for any point w on Tfwo which 
is 0{e) close to wq, and any direction g along (i-S- unit g such that (g,vv^o) = 0). it holds that 
g^V^(w)g £ [A, 20]. This implies that the second derivative in an 0{e) neighborhood of -wq on Tfwo is 
always in [A, 20], hence the function is both A-strongly convex in that neighborhood. 

More formally, letting e £ (0,1) be a small parameter to be chosen later, consider any wq such that 


I’Woll = 1 , ||wo - Vill < e, 


any w such that 
and any g such that 


(w — Wq, Wq) = 0 , jjw — WqI! < 2e, 


20 


the 


l|g|| = 1 , (g,wo) = 0. 

Our goal is to show that for an appropriate e, we have g^V^(w)g £ [A, 20]. Moreover, by Lemma 
neighborhood set Tfwo O i?wo(2e) would also contain a point avi for some a, which is a global optimum of 
F due to its scale-invariance. This would establish the theorem. 

The easier part is to show the upper bound on g^ (w)g. Since g is a unit vector, it is enough to bound 

the spectral norm of V^(w), which equals 


1 


w 


I- 


rWW 


T 


W 


< 


< 


< 


|W|| 

2 


I- 


rWW 


w 


F{w)I + A 
F{w)I + A 


sp 


Sp 


I W|| 
2 

I w|| 


I- 


rWW 


T 


W I 


||F(w)/ + ^| 


sp 


sp 


\ sp 


+ 


rWW 


w 


i\\Fiw)I\Up + 


sp ) 


sp , 


Since the spectral norm of A is 1, and Uwjj^ > 1 (as -w lies on a hyperplane Tfwo tangent to a unit vector 
vv^o)^ it is easy to verify that this is at most 2(1 + 4)(1 -|- 1) = 20 as required. 

We now turn to lower bound g^ V^(w)g, which by Lemma [l^ equals 
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Since = g^-Bg + g^B^g = 2g^Bg, the above equals 
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ww^ ) ( F(w)/ + A ) g. 
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(26) 


Using the fact that w = wq + (w — wq), and (g, wq) = 0, we get that (g, w) = (g, w — wq). Moreover, 
since A is positive semidefinite and has spectral norm of 1, F(w) = —G [—1,0]. Expanding Eq. (261 
and plugging these in, we get 


> 


|w|| 

2 

||w|| 

2 

llwll 


i"(w)gT (I- 


W I 


rWW^ I g + g^ ( I - 


W 


Ag 


-F(w)||g||^ + - wo)^ - g’^Ag + T-^(g, w - wo)w^Ag 


-F(w)||gf- 


|w| 

4 


w 


w 


|g|p||w-wo|p - g^^g- 


w 


|g|| ||W — Woll ||w 


sp 11 g I 


Since ||g|| = 1, ||^||sp = 1, jjw — wojj < 2e, and ||w|e = ||wo|f + jjw — wojp is between 1 and 1 + 4e^, 
this is at least 
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—F{w)) — 16e^ — g^^g — 8e\/l + 46^) = 
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—F(w) — g'''Ag — 8e(2e + \/l + 


(27) 

Eet us now analyze —F(w) and g^ Ag more carefully. The idea will be to show that since we are close 
to the optimum, —F(w) is very close to 1, and g (which is orthogonal to the near-optimal wq) is such that 
g^Ag is strictly smaller than 1. This would give us a positive lower bound on Eq. (27 1 . 

• By the triangle inequality and the assumptions jjwo —vijj < e, jjw —wojj < 2e, we have jjw —vijj < 
3e. Also, we claim that F(-) is 4-Eipschitz outside the unit Euclidean ball (since the gradient of F at 
any point with norm > 1, according to Eemma 19 has norm at most 4). Therefore, |F(w) + 1| = 
|F(w) — F(vi)| < 4||w — viII < 12e, so overall, 

F(w) < -1 + 12e. (28) 


• Since (wq, g) = 0, and || wq — vi || < e, it follows that 

|(vi,g)| < |(vi - wo,g)| + |(wo,g)| < ||vi - wo||||g||+0 < e. 


Eetting vi,..., and 1 = si > S 2 > .• > Sd > 0 be the eigenvectors and eigenvalues of A in 
decreasing order (and recalling that S 2 < si — A = 1 — A for some eigengap A > 0), we get 


d d 

g"^^g = < (vi,g)^ + (1 - A)^(vi,g)2 

i=l i=\ 

= (vi,g)^ + (1 - A)(l - (vi,g)2) = A(vi,g)2 + (1 - A) 

< Ae^ + (1 - A) = 1 - (1 - e^)\. (29) 
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Plugging Eq. (281 and Eq. (291 back into Eq. (27 1 , we get a lower bound of 
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1 - 12e - (l - (1 - e2)A) - 8e (2e + \/l + 4e2 
(^(1 - e2)A - 8e (^1.5 + 2e + \/l + 4e2 
8e(^1.5 + 2e + \/l + 4e2) 


1 - - 
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Using the fact that vT+^ <1 + 2 , this can be loosely lower bounded by 

2 /_ 8e(2.5 + 4e)' 
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1 -e- 


A 


A. 


Recalling that 11 w 11^ = 11 wq |P + 11 w — wq |P is at most 1 + 4e^, and picking e sufficiently small compared to 
A, (say e = A/44), we get that the above is at least A, which implies the required strong convexity condition. 

To summarize, by picking e = A/44, we have shown that the function F(w) is A-strongly convex and 
20 -smooth in a neighborhood of size 2e = ^ around wq on the hyperplane H-^q, provided that || wq— vi || < 
e = g. By Lemma 20 we are guaranteed that this neighborhood contains vi up to some rescaling (which 
is immaterial for our scale-invariant function F), hence by optimizing F in that neighborhood, we will get 
a globally optimal solution. 
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